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1 Introduction: the replicability of the Social Sciences 


This paper concerns the prevalence and the causes of low replication rates in Social Sciences. 
The aim is to frame unintentional errors as scientific misinformation, and questionable research 
practices as disinformation. In Section 3 is presented Multiverse Analysis, which helps the 
assessment of the uncertainty about scientific claims and reduces false discoveries. 

In order to introduce the topic of replication rate in Science, it is important to clarify the 
epistemological conditions to claim a scientific result to be replicated: 


1. A scientific result consists of a claim A that is deduced through a procedure that can be 
reproduced by a third party (Goodman et al., 2016). A proper scientific result should be 
reported in an authoritative scientific venue, usually a peer-reviewed journal. 

2. Others try to refute A by reproducing the same procedure on a different sample or adopt- 
ing advanced but theoretically coherent alternative procedures on the original sample. 

3. But these attempts fail: new results are not incompatible with A. 


A replicated scientific theory is a collection of connected claims that are, for most, indi- 
vidually replicated (Lakatos, 1976; Schmidt, 2009). A replication rate is the rate of replicated 
results given a grouping variable: an author, an institution, or a scientific field. High replication 
rates are observed in exact sciences. Often, these replications are implicit: after a few success- 
ful experiments, a scientific theory is applied to more complex theories or technologies. The 
application of a theory is an implicit process of scientific replication (Feigenbaum and Levy, 
1996). Methods of Social Sciences are not exact but probabilistic, harder to reproduce (e.g. due 
to changes in society), and applications into social policies are more nuanced than the vertical 
integration of natural sciences into technology. 

Often claimed causal effects in Social Sciences are just statistical artifacts. Even meta- 
analyses are biased by so-called ‘publication bias’ (Nissen et al., 2016). It has been empirically 
demonstrated, indeed, that not significant estimates are less likely to be published in scientific 
venues (van Zwet and Cator, 2021). Prof. Breznau’s research group provided the same dataset 
to 73 independent teams of quantitative social scientists, for a total of 161 people. He asked 
them to estimate the effect of immigration rates on public support for welfare-oriented political 
agenda. A sample of n > 1, 200 estimate values for the effect has been drawn through this sur- 
vey. Of the estimates, 25% were significantly negative, 17% significantly positive, and 57.7% 
of the times the specified model failed to reject the null hypothesis (Breznau et al., 2022). Im- 
pressively, based on this result, not only it is almost impossible to claim that a general effect 
exists, but even to fully deny it, because it is always possible to assert that an effect holds under 
specific conditions. 

The U.S. Agency for Defense Advanced Research Projects (DARPA) understood the prob- 
lem of traditional approaches for Meta-Analysis and Causal Inference and launched the Sys- 
tematizing Confidence in Open Research and Evidence (SCORE) Project to understand how to 
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predict if a study is deemed to fail to replicate. Preliminary findings have not been rosy: with 
exception of Economics, social scientists believe that their own fields produce more not replica- 
ble claims than replicable ones, i.e. there are more false discoveries than not. Economics seems 
to suffer of overconfidence in itself (Gordon et al., 2020). These results came after a large study 
led by Brian Nosek that attempted to replicate 100 claims in Psychology journals: less than 
half passed a replication attempt (OPEN SCIENCE COLLABORATION, 2015). Journals with 
high bibliometric scores do not perform better than other sources: evidence is in the direction 
of zero or negative correlation between bibliometric performances (e.g. journal impact factor) 
and replication rates (Szucs and Ioannidis, 2017; Brembs, 2018; Camerer et al., 2018). 


2 Misinformation and disinformation 


Ioannidis (2005) summarised predictors of low replication rates: small sample sizes, small 
effect sizes, and more than one hypothesis being tested on the same sample. On top of this, 
he stresses the incentives to look for novel findings instead of replication studies, too. He 
claims that papers on new theories are always more cited than their replication attempts, even 
when replication is not attained! This is a case of misinformation: inaccurate claims spread 
more than their corrections. Disinformation is a distinct phenomenon, where false claims are 
justified through a process of fabrication (West and Bergstrom, 2021). It is not necessary to 
report fake data to fabricate a fake result. The insidious alternative is to omit observed results. 
This behaviour is called “hacking the science” in the scientific community, by analogy with the 
method of bruteforcing many random combinations of inputs until a singular desired outcome 
is achieved by chance, e.g. hacking a password (Imbens, 2021). 


2.1 Misinformation: is Duning-Kruger effect a statistical artifact? 


It is commonly observed that the correlation between performance and self-assessment of per- 
formance is significantly negative. Since performance depends on skill, the theory of Duning- 
Kruger Effect (Kruger and Dunning, 1999) or DK, explains this correlation through the claim 
that unskilled people have a tendency to overestimate their own skills. The original study, with 
more than 8,000 citations, is foundational for modern Pedagogy. A concurrent to DK is the 
“better than average” theory (Krueger and Mueller, 2002), or BTA. It claims that all people 
have a tendency to self-assess their skills above the average, independently of their skill. These 
two theories can coexist but if BTA is true, then the DK effect is overestimated. 

Consider the conservative case of two actors: one with a true skill score xı = 40 and the 
other with a true skill score x2 = 60. Their average is z = 50. Assume the claim of BTA: actor 
1 and actor 2 have exactly the same model of assessment of self-score: they adopt the average 
plus an expected positive error €*. In this case, it holds 


|e — (z + e+ )| > |x —(E+ an) Wer (1) 


where |x — (Z + €*)| is the absolute error between true skill and self-assessed skill. It fol- 
lows that: even with absolutely no cognitive differences between classes of actors (i.e. e” 
is unique across actors), the less skilled actor has a larger absolute deviation. In this case, 
even if DK is not true, then the parameter e" would induce a negative correlation. With 
few generalisations it is shown that any model that parameterises the self-assessed score to 
fix +e+; VX : {21, £2, £3, ...,%,} would lead into an artificial DK effect, even when DK is not 
true. The effect would hold even for normally distributed positive efor- 

A meta-analytical study that adopted advanced statistical techniques found that, given the 
observed scores in the literature, DK is likely to be a statistical artifice due to BTA (Gignac and 
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Zajenkowski, 2020). Another study reports only partial support for a true DK effect while con- 
firming BTA (Jansen et al., 2021). Here no information has been concealed or fabricated. The 
authors did not adopt any questionable research practices. They lacked the correct specification 
of their null model. 


2.2 Disinformation: six degrees of separation and even more 


The expression “small world” refers to a network where a part of the connections happens 
with a uniform probability, and another part happens with a higher probability to form triadic 
closures (fully connected triangles of nodes). As emergent propriety, small world networks have 
a “characteristic average path length” L: for any given node in the network, any other node can 
be reached only by crossing paths with an expected length equal to L, independently by the 
number of nodes in the network. 

Formation and structure of small-world networks have been described in the Watts-Strogatz 
model (Watts and Strogatz, 1998), but the description of this network goes back to Milgram 
(1967). Indeed, the implicit claim of Milgram is that in modern societies (pre-Internet) there is 
a characteristic path length L between human connections and that L is relatively short. Curi- 
ously, the paper with the experiment that originated the catchphrase “six degrees of separation” 
(Travers and Milgram, 1969) has been published only 2 years after a theoretical paper (Milgram, 
1967) claiming the emergency of L in human societies. Together, the two papers collected more 
than 13.000 citations and, a rare case for a social science theory, they inspired new ideas not 
only in business (marketing, etc.) but also in engineering (transports, etc.). 

It was a surprise for Judith Kleinfield (2002) to discover that the paper presenting the actual 
report of the in vivo experiment of the theory (Travers and Milgram, 1969) is actually poor in 
terms of statistical results. 296 participants have been recruited for the study. Their task was 
to send a document to one of their pre-existing social ties with the final aim that this document 
could reach a specific male broker in Boston. These 296 participants have been sampled across 
three populations: not brokers in Nebraska, brokers in Nebraska, and brokers in Boston. 

This stratification would have been helpful if just enough documents reached their final 
destination: only 214 original participants sent the document and only 64 documents reached 
Boston’s broker, after s stages. Among these 64, the observed average path length | = 5.2. The 
territorial variable was the only statistically significant. The number 6 (degrees of separation) 
is never explicitly mentioned, however, in footnote 4 the authors mention that they adjusted / 
through a not better specified marginal distribution of probabilities of reaching the final node at 
s + 1 stage (see paramter Q;). In footnote 4, they claim a confidence interval for L between 5 
and 7. Is there sufficient evidence for claiming that L exists? From the sample of not brokers 
from Nebraska, only 18 documents reached the destination, with l = 5.7. This result could be 
generalised to the U.S. population but the sample size would be small. 

Kleinfield (2002) investigated Milgram’s archives, looking for more. She only found con- 
cerning details: 


e Milgram (1967) mentions a pilot study where a document has been received by a woman 
in only four days. Kleinfield found the pilot’s report and concluded that Milgram picked 
an interesting anecdote but he never published more details about the pilot because it was 
a failure. Attrition in the pilot was so high to make meaningless the observed statistics. 
Qi is never mentioned in the pilot. 

e Travers and Milgram (1969) tried to alter the attrition rate in two ways: avoiding to recruit 
social outcasts and modifying the document from a single piece of paper to a “passport” 
in bright colours. 

e She found an anonymous manuscript about a third attempt with inconsistent results. 
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2.3 p-hacking 


The first case study falls under the category of ‘misinformation within science’ because it re- 
gards how the reputation of theories spreads within science even when a new model has been 
proven more consistent. The second case study is different: researchers concealed results from 
their own research because these were inconclusive toward their hypothesis. This is relatable to 
the case of so-called p-hacking of the level of significance a for rejection of the null hypothesis 
in Statistical testing. p-hacking is a fraud because it omits to report the number of tests at- 
tempted before reaching a statistically significant result in data analysis (Simmons et al., 2011; 
Head et al., 2015). p-hacking is typically done in two ways: 


1. Parallel p-hacking: many tests are arranged on different samples of the same population. 
Each sample has a minimal size but it is large enough to be deemed credible by the typical 
reader. Once a positive outcome is seen, no further test is necessary. In the reported result 
of the study, the number of tested samples is omitted and only the one associated with 
p < ais reported. As a reference: if the parameter of the effect size is equal to 0 and 
the null hypothesis of the test is true; with a = .05, after 14 tests (Bernoulli trials of 
parameter a), the probability to see a p < a in at least a test is 


14 


Xa: (1-a)! > 51 (2) 


k=1 


following the geometric distribution of the Bernoulli trials!. 

2. Sequential p-hacking: a multivariate dataset is collected and a hypothesis is formalised 
with a simple model. If the statistics of the model are not significant, then the specification 
of the model is trivially adjusted (e.g., control variables are added to the model, outliers 
are removed, data is pre-processed differently, etc.) until a random p < a is achieved. 
All of these operations are not reported. This is a fraudulent type of Hypothesising After 
Results are Known, or HARKing (Rubin, 2017). 


3 Remedies: pre-registration and Multiverse Analysis 


A possible remedy for science hacking is pre-registration, that is to record in a dedicated elec- 
tronic archive an anonymous manuscript that details all the research questions and the methods 
of incoming research. This happens before the data collection, so in a peer-review authors can 
certify that their analysis is coherent with the original research design and that hypotheses are 
not drawn after knowing the sample statistics (Nosek et al., 2018). Pre-registration has two 
problems: (i) nothing prevents p-hacking a result, pre-registering its specification, then submit- 
ting the complete manuscript for peer-review (Yamada, 2018); (ii) it does not allow serendipi- 
tous discoveries incoherent with what is pre-registered (Simmons et al., 2021). 

Looking back at the crowd-sourced estimation in Breznau et al. (2022), this approach is 
kindred to a meta-analytical paradigm called Multiverse Analysis: Gelman and Loken (2014) 
popularised the assumption that the robustness of a scientific model can be estimated through 
trivially altering its specification. They call “degrees of freedom of the researcher” the analytical 
choices in data analysis, e.g. the choice of a link function in binomial regression between logit 
and probit. Steegen et al. (2016) introduced the concept of the “multiverse” of a scientific claim. 
These degrees of freedom are the source of errors in estimation. 

In particular, claims are formalised into models. Assuming that a true parameter 0 of the 
model exists, given a dataset, exists a set O; = {6;} of estimates from different j-specifications 


'The equivalent command in R language is pgeom (13, .05). 
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of the model such that each estimate 6; sufficiently close to 0 and E(4;) = 0 holds. How to draw 
a sample that is representative of O; in order to ascertain the uncertainty associated with the 
error of misspecification (model error)? Crowd-sourced estimation (Breznau et al., 2022) draws 
a random sample of specifications and estimates just by surveying experts. Instead, Multiverse 
Analysis draws a systemic (not random) sample J of specifications through mapping all the 
degrees of freedom of the researcher, e.g. inclusion/exclusion of control variables, operations 
in data pre-processing, modelling choices for overdispersion, etc. and combining them into Î, 
that is the multiversal sample of specifications or just the “multiverse”. 

Multiverse Analysis assumes that measures of variability in the observed multiversal esti- 
mates 6 jej are as much if not more informative than parametric or bootstrapped standard error 
or confidence intervals about the uncertainty involved in the estimation of (Young and Hol- 
steen, 2017; Simonsohn et al., 2020). An interesting application of Multiverse Analysis is for 
checking the Janus effect (Patel et al., 2015), which is when in the same multiverse co-exist 
Statistically significant 6, but with different signs. Janus Effect is a red flag in the sample of 
so-called parametric type S error (Gelman and Tuerlinckx, 2000). 
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